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k* . Abstract 

We describe the construction of a high performance parallel computer composed of PC 
components, present some physical results for light hadron and hybrid meson masses from 
lattice QCD. We also show that the smearing technique is very useful for improving the 
spectrum calculations. 
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1 Introduction 

The interests of the computational physics and high energy physics group ^ at the Zhongshan 
University (ZSU) cover such topics as lattice gauge theory J21E1 |U El El El El El; supersvmmetrv[TTi] . 
quantum instantons|llj and quantum chaos [T3lll2|. All of these topics can be investigated through 
Monte Carlo simulation, but can be quite costly in terms of computing power. In order to do 
large scale numerical investigations of these topics, we require a corresponding development of 
our local computing resources. 

The last two decades have ushered in the computer revolution for the consumer. In this period 
computers have moved from the domain of large companies, universities, and governments, to 
private homes and small businesses. As computational power has become more accessible, our 
demands and expectations for this power have increased accordingly. 

We demand an ever-increasing amount of computational ability for business, communication, 
entertainment, and scientific research. This rapid rise in both the demand for computational 
ability as well as the increase of that capability itself have forced a continual redefinition of the 
concept of a "super computer." The computational speed and ability of household computers 
now surpasses that of computers which helped guide men to the moon. The demarcation between 
super computers and personal computers has been further blurred in recent years by the high 
speed and low price of modern CPUs and networking technology and the availability of low cost 
or free software. By combining these three elements - all readily available to the consumer - 
one can assemble a true super computer that is within the budget of small research labs and 
businesses. This type of cluster is generally termed a Beowulf class computer. The idea was 
originally developed as a project at the US National Aeronautics and Space Administration 14 . 

We document the construction of a cluster of PCs, configured to be capable of parallel pro- 
cessing, and show the performance in lattice QCD simulations. We also present some results for 
the hadron masses from lattice QCD. 

2 Construction of a Parallel Cluster 

2.1 Computational Hardware 

We built a cluster of 10 PC type computers, all the components of which we purchased at normal 
consumer outlets for computer equipment. The major difference in our computers from one likely 
to be found in a home or business is that each is equipped with two CPUs. This allows us 
to roughly double our processing power without the extra overhead cost for extra cases, power 
supplies, network cards, etc. Specifically, we have installed two 500MHz Pentium III processors in 
each motherboard. For the purposes of this report we will describe each computer as one "node" 
in the cluster; i.e., a node has two processors. Each node has its own local EIDE hard disk, in 
our case each has 10GB. This amount of space is not necessary, as the operating system requires 
less than one gigabyte per node, however the price of IDE hard disks has dropped so rapidly 
that it seems a reasonable way to add supplementary storage space to the cluster. Furthermore, 
each node is equipped with memory (at least 128MB), a display card, a 100Mbit /s capable fast 
Ethernet card, a CDROM drive and a floppy drive. These last two items are not an absolute 



necessity as installation can be done over the network, but they add a great deal of convenience 
and versatility for a very modest cost. 

One node is special and equipped with extra or enhanced components. The first node acts as a 
file server and has a larger (20GB) hard disk. This disk is the location of all the home directories 
associated with user accounts. The first node also has a SCSI adapter, for connecting external 
backup devices such as a tape drive. 

What each computer does not have is a monitor, keyboard, and mouse. Monitors can easily 
be one of the most expensive components of a home computer system. For a cluster such as this 
one, the individual nodes are not intended for use as separate workstations. Most users access the 
cluster through network connections. We use a single console (one small monitor, a keyboard and 
mouse) for administrative tasks. It is handy when installing the operating system on a new node. 
In this situation we move the console cables to the particular node requiring configuration. Once 
we have installed communications programs such as telnet and ssh, it is almost never necessary 
to move the monitor and cables to the subordinate nodes. 

2.2 Communications Hardware 

There are many options for networking a cluster of computers, including various types of switches 
and hubs, cables of different types and communication protocol. We chose to use fast Ethernet 
technology, as a compromise between budget and performance demands. We have already stated 
that we equipped each node with a lOOMbit/s capable fast Ethernet card. A standard Ethernet 
hub has the limitation on not being able to accommodate simultaneous communications between 
two separate pairs of computers, so we use a fast Ethernet switch. This is significantly more 
expensive than a hub, but necessary for parallel computation involving large amounts of inter- 
node communication. We found a good choice to be a Cisco Systems 2900 series switch. For 
ten nodes a bare minimum is a 12 port switch: one port for each node plus two spare ports for 
connecting either workstations or a connection to an external network. We have in fact opted for 
a 24 port switch to leave room for future expansion of the cluster as our budget permits. 

100Mbit per second communication requires higher quality "Category-5" Ethernet cable, so 
we use this as the connection between the nodes and the switch. It should be noted that while 
a connection can be made from one of the switch ports to an external Internet router, this cable 
must be "crossover" cable with the input and output wire strands switched. The general layout 
of the cluster hardware is shown in Figure ^ 

2.3 Software 

For our cluster we use the Linux open source UNIX-like operating system. Specifically, we have 
installed a Redhat Linux distribution, due to the ease of installation. The most recent Linux kernel 
versions automatically support dual CPU computers. Linux is also able to support a Network 
File System (NFS), allowing all of the nodes in the cluster to share hard disks, and a Network 
Information System (NIS), which standardizes the usernames and passwords across the cluster. 

The one precaution one must take before constructing such a cluster is that the hardware 
components are compatible with Linux. The vast majority of PC type personal computers in 
the world are running a Windows operating system, and hardware manufacturers usually write 



only Windows device drivers. Drivers for Linux are usually in the form of kernel modules and are 
written by Linux developers. As this is a distributed effort, shared by thousands of programmers 
worldwide, often working as volunteers, every PC hardware component available is not necessarily 
immediately compatible with Linux. Some distributions, such as Redhat have the ability to probe 
the hardware specifications during the installation procedure. It is rather important to check on- 
line lists of compatible hardware - - particularly graphics cards and network cards - - before 
purchasing hardware. We began by purchasing one node first and checking the compatibility with 
the operating system first before purchasing the rest of the nodes. 

To provide parallel computing capability, we use an Message Passing Interface (MPI) im- 
plementation. MPI is a standard specification for message passing libraries |15j. Specifically we 
use the mpich implementation, which is available for free download over the world wide web|16j. 
An MPI implementation is a collection of software that allows communication between programs 
running on separate computers. It includes a library of supplemental C and FORTRAN functions 
to facilitate passing data between the different processors. 



3 Basic Ideas of Lattice QCD 

Our main purpose for building the PC cluster is to do large scale lattice Quantum Chromody- 
namics (QCD) simulations. The basic idea of lattice gauge theory [T8]. as proposed by K. Wilson 
in 1974, is to replace the continuous space and time by a discrete grid: 



L 



site 
1 « 



* J link 



[ J plaquette 



Gluons live on links U(x, n) = e l9 ** x ^ x , and quarks live on lattice sites. The continuum 
Yang-Mills action S g = J d 4 x TrF Ml ,(x)F^(x)/2 is replaced by 



^ = -fE Tr ( u p + u l- 2 )> 

V 



(1) 



where j3 = Q/g 2 , and U p is the ordered product of link variables U around an elementary plaquette. 
The continuum quark action S q = J d 4 x t^ cont (x)('y^D f j j + m)il) cont (x) is replaced by 



S q = ^2i>(x)M x , y i;(y). 
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For Wilson fermions, the quark field ift on the lattice is related to the continuum one tp cont by 
tp = ip cont a s /(2n) with k = l/(2ma + 8). M is the fermionic matrix: 
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M X)V = S XjV - k ^2 [(1 - r Yn)U li (x)S Xt y-fi + (1 + j^Ufa - p,)S x>y+ p, ■ (3) 
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For Kogut-Susskind fermions, the fermionic matrix is given by 

M^ = mo5a,,j, + - ^2 ^( x ) U li (x)S x ,y-fi -Ufa- p)8 x 
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77 M (x) = (_i)-i+-2+-+-m-i. (4) 



Physical quantities are calculable through Monte Carlo (MC) simulations with importance 
sampling. Fermion fields must be integrated out before the simulations, leading to 

Here F is the operator after Wick contraction of the fermion fields and the summation is over the 
gluonic configurations, Conf, drawn from the Boltzmann distribution. In quenched approxima- 
tion, det(M) = 1. 

We introduce the u and d quark propagators 

Qs 1 c 1 ,s 2 C2\ x ^y) = ™ [U,K = «dJajsici,j/S2C2) (6) 

where the Dirac and color indexes are explicitly written. In general, most of the computer time in 
the simulation of hadron masses or dynamical quarks is spent on the computations of the quark 
propagators. Usually these operations are accomplished by means of some inversion algorithm, 
which solves linear equation systems. 

To compare with the real world, the continuum limit a — > should be eventually taken. On the 
other hand, to keep the physical volume (La) 5 unchanged, the number of spatial lattice sites L 3 
should be very large. To reliably measure the effective mass of a hadron, one has also to increase 
the number of temporal lattice sites T accordingly. Therefore, the computational task will then 
be tremendously increased. As such, it is well suited for parallelization. A parallel lattice QCD 
algorithm divides the lattice into sections and assigns the calculations relevant to each section to 
a different processor. Near the boundaries of the lattice sections, information must be exchanged 
between processors. However, since the calculations are generally quite local, the inter-processor 
communication is not extremely large. 

4 Performance and Cost 

We ran a standard LINPACK benchmark test and determined the peak speed of a single 500MHz 
Pentium III processor. The results of this test are shown in Table Q to be about 100 million 



Table 1: Results of LINPACK benchmark test on a single CPU. 



Precision 


Mflop 


single 
double 


86 - 114 
62- 68 



Interface directions 


hypercubes (CPUs) 


Lattice volume 


Total interface 


interface / CPU 


3 


V 


L 4 "J x (2iy 


2»jL 3 


jL 3 



1 
2 
3 

4 


1 
2 

4 
8 
16 


L 4 
L 3 x 2L 

L 2 x (2L) 2 
L x (2L) 3 

(2L) 4 



2L 3 
8L 3 
24L 3 
64L 3 




L 3 
2L 3 
3L 3 

4L 3 



Table 2: Summary of boundary sizes for division of a lattice into 1, 2, 4, 8 and 16 hypercubes of 
size L 4 . 



floating point operations a second (Mflops). With this in mind, we can say that the theoretical 
upper limit for the aggregate speed of the whole cluster (20 CPUs) approaches 2 Gflops. Of 
course this is possible only in a computational task that is extremely parallelizable with minimum 
inter-node communications, no cache misses, etc. In the year 2000, the cost for our entire cluster 
was about US$15,000, including the switch. This means that the cost for computational speed 
was about US$7.50/Mflop. (Eliminating less essential hardware such are CDROMS, display cards, 
and floppy drives and using smaller hard disks on the subordinate nodes would further reduce this 
number.) It is instructive to compare this to other high performance computers. One example 
is a Cray T3E-1200. Starting at US$630,000 for six 1200 Mflop processors. If, the cost is about 
US$87.50 per Mflop. The Cray is more expensive by an order of magnitude. Clearly there are 
advantages in communication speed and other performance factors in the Cray that may make 
it more suitable for some types of problems. However, this simple calculation shows that PC 
clusters are an affordable way for smaller research groups or commercial interests to obtain a high 
performance computer. 

A widely used lattice QCD simulation program is the MILC (MIMD Lattice Collaboration) 
code [TJ||. It has timing routines provided so that one can use the parallelized conjugate gradient 
(CG) routine for inverting the fermionic matrix in the simulation as a benchmark. Furthermore, 
as this code is very versatile and is designed to be run on a wide variety of computers and 
architectures. This enables quantitative comparison of our cluster to both other clusters and 
commercial supercomputers. In the MILC benchmark test we ran to a convergence tolerance 
of 10 -5 per site. For consistency with benchmarks performed by others, we simulated Kogut- 
Susskind fermions given by Eq. (JIJ). 

We illustrate the result of the MILC code benchmark test in Figure |2j This figure deserves 
some explanation. We have run the benchmark test for different size lattices and different numbers 
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single processor speed (Mflops) 


4 


161.5 


6 


103.2 


8 


78.6 


10 


76.4 


12 


73.9 


14 


75.9 



Table 3: Summary of single CPU performance. 



of processors. It is useful to look at how performance is affected by the number of CPUs, when 
the amount of data per CPU is held fixed, that is each CPU is responsible for a section of the 
lattice that has L 4 sites. For one CPU, the size of the total lattice is L 4 . For two CPUs it is 
L 3 x 2L. For four CPUs the total lattice is L 2 x (2L) 2 ; for eight CPUs, L x (2L) 3 , and for 16 
CPUs the total size of the lattice is (2L) 4 . 

Note that the falloff in performance with increased number of CPUs is dramatic. This is 
because inter-processor message passing is the slowest portion of this or any MPI program and 
from two to sixteen CPUs, the amount of communication per processor increases by a factor of 
four. Table El shows that for a lattice divided into 2 jf hypercubes, each of size L 4 , there will be j 
directions in which the CPUs must pass data to their neighbors. The amount of communication 
each processor must perform is proportional to the amount of interface per processor. As this 
increases, per node performance decreases until j = 4 and every lattice dimension has been 
divided (for a d = 4 simulation), and the per-processor performance should remain constant as 
more processors are added. The shape of this decay is qualitatively consistent with 1/j falloff. 

Of course there are other ways to divide a four-dimensional lattice. The goal of a particular 
simulation will dictate the geometry of the lattice and the therefore the most efficient way to 
divide it up (generally minimizing communication). A four-CPU simulation using a 4L x L 3 
lattice has the four hypercubic lattice sections lined up in a row (as opposed to in a 2 x 2 square 
for a I? x (2L) 2 lattice) and has the same amount of communication per CPU as does the L 3 x 2L 
two-CPU simulation. In a benchmark test the per-CPU performance was comparable to the 
performance in the two-CPU test. 

For a single processor, there is a general decrease in performance as L increases, as shown in 
Tab. El This is well explained in [20] as due to the larger matrix size using more space outside of 
the cache memory, causing slower access time to the data. 

For multiple CPUs there is in performance improvement as L is increased. The explanation 
for this is that the communication bandwidth is not constant with respect to message size, as 
Fig. El shows. For very small message sizes, the bandwidth is very poor. It is only with messages 
of around lOkB or greater that the bandwidth reaches the full potential of the fast Ethernet 
hardware, nearly lOOMbit/sec. With a larger L, the size of the messages is also, improving the 
communication efficiency. The inter-node communication latency for our system is 102/xs. As 
inter-node communication is the slowest part, a parallel program this far out-ways the effect of 
cache misses. 



5 Physics Results 

5.1 Green functions 

Calculation of hadron spectroscopy remains to be an important task of non-perturbative studies 
of QCD using lattice methods. In this paper, we will present the spectrum results of light hadrons 
and 1 *" hybrid meson with quenched Wilson fermions. it, p, proton or A ++ consists of quark 
and anti-quark, and their operators are given by: 

O 71 ( x ) = d slc (x)^5 tS1S2 u S2C (x), 

O p k (x) = d slc (x)-f k:SlS2 u s . 2C (x), 

^Si\ X ) = £ciC2C3{^ r Y5)s2S3 U SlCl { X ) \ U S2C2\ X ) ( ''S3C3{%) ~ "S2C2 \ X ) U S3C3 \ X ) ) ) 
^S\ \ X I ^ClC2C3\ < ^1b)s2S3 U S\Cl \ X ) U S2C2 \ X ) U S3C3 K X ) J {') 

where u and d are the "up" and "down" quark fields, C is the charge conjugation matrix, c is the 
color index of the Dirac field, and s is the Dirac spinor index. Summation over repeated index is 
implied. The correlation functions of a hadron is: 

C h (t) = J2(O{( x ,t)O h (0,0)), (8) 

X 

where Oh(x,t) is a hadron operator given in Q. Then, 
CV+(i) = -(E Tr -(^(^0) 75 Q u (0,:r))), 

X 

<V(*) = -(Y, Tr ^(^Qd{ x ,0h k Q u (0,x))), 

X 

^p\t) = e ClC2C3 e C4C5C6 (C/75j S3S4 {^l5) S5 s a [Qsic 1 ,S2C4\ X 'y)Qs3C2,s 5 c 5 { x >y)Qs4C3,S6Ci i { x iy) 
' ^S\C\,S$C4\ X 1 y)**VS3C2,S2C5\ X l y)^CS4C3,S6C(j\ X l V) J 1 

Ca++w = e clC2C3 e C4C5C6 (Cjk) S3S4 (C7fc) S5S6 \Qsic 1 ,s2c i ( x -y)Qs3C2,s 5 cr { x >y)Qs4C3,s 6 c G ( x iy) 

' ^^5SlCl,S 5 C4 \ X 1 y)^S 3 C2,S2C5 \ X > y)^S4C3,S 6 C 6 \ X 1 V) J ) 

(9) 

where Tr sc stands for a trace over spin and color. In Tab. 01 we list the operator for the P-wave 
a\ meson, which is also made of quark and anti-quark. 

Hybrid (exotic) mesons, which are important predictions of quantum chromodynamics (QCD), 
are states of quarks and anti-quarks bound by excited gluons. First principle lattice study of such 
states would help us understand the role of "dynamical" color in low energy QCD and provide 
valuable information for experimental search for these new particles. In Tab. 0J the operator of 
1 !" meson is given. 

For sufficiently large values of t and the lattice time period T, the correlation function is 
expected to approach the asymptotic form: 

C h (t) ->• Z h [exp(-m h at) + exp(m h at - m h aT)]. (10) 
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Table 4: Source and sink operators for a\ and hybrid mesons. 



volume 


P 


warmup 


stored configs. 


8 3 x 32 


5.7 


200 


200 


8 3 x 32 


5.85 


200 


200 


12 3 x 36 


6.25 


200 


200 


16 a x 32 


6.25 


600 


600 



Table 5: Simulation parameters 



Fitting the equation at large t, the effective mass of a hadron amh is obtained. 



5.2 Light hadron masses 

We updated the pure SU(3) gauge fields with Cabibbo-Marinari quasi-heat bath algorithm, each 
iteration followed by 4 over-relaxation sweeps. The simulation parameters are listed in Tab. [5J 
The distance between two nearest stored configurations is 100. The auto-correlation time was 
computed to make sure that these configurations are independent. 

The u quark are d quark are assumed to be degenerate. Using the CG algorithm, the quark 
propagators in Eq. (jfJJ) are calculated by inverting the Dirac matrix with preconditioning via ILU 
decomposition by checkerboards. The convergence tolerance we set is 5 x 10 -8 . To extract masses 
from the hadron propagators, we must average the correlation function in Eq. Q of the hadron 
over the ensemble of gauge configurations, and use a fitting routine to evaluate arrih in Eq. (|10|). 

The quenched simulations were performed at lattice coupling of (3 = 5.7, /3 = 5.85 on the 
8 3 x 32 lattice. We compared the results with those by MILC and GF11. At (3 = 6.25, we 
computed the light meson and baryon masses on the 12 3 x 36 and 16 3 x 32 lattices. The data for 
(3 = 6.25 have been reported in Ref. [23] • Here we detail the results for (3 = 5.7 and (3 = 5.85. 

In Fig. 01 we show the pion correlation function at /3 = 5.85, and k = 0.1585. In selecting the 
time range to be used in the fitting, we have tried to be systematic. We choose the best fitting 
range by maximizing the confidence level of the fit and optimizing x 2 /d.o.f. 

Point source means a delta function, and smeared source means a spread-out distribution (an 
approximation to the actual wave-function of the quantum state). For example, the simplest 
operator for a meson is just Oh(x) = q(x)q(x), i.e. the product of quark and anti-quark fields 
at a single point. A disadvantage of this point source, is that this operator creates not only the 
lightest meson, but all possible excited states too. To write down an operator which creates more 



Particle 


Group 


Configs 


Lattice 


train 


tmax 


Mass 


X 2 /dof 


C.L. 


7T 


MILC 
ZSU 


90 

200 


16 3 x 32 
8 3 x 32 


7 
8 


16 

16 


0.378(2) 
0.379(6) 


12.38/8 
8.74/7 


0.135 
0.163 


P 


MILC 
ZSU 


90 

200 


16 3 x 32 
8 3 x 32 


8 
8 


16 
16 


0.530(3) 
0.533(7) 


2.857/7 
4.67/7 


0.898 
0.216 


proton 


MILC 
ZSU 


90 

200 


16 3 x 32 
8 3 x 32 


7 
8 


16 
16 


0.783(10) 
0.796(19) 


8.339/8 
11.36/7 


0.401 
0.112 


A 


MILC 
ZSU 


90 

200 


16 3 x 32 
8 3 x 32 


8 
8 


16 
16 


0.852(11) 
0.857(13) 


9.302/7 
16.51/7 


0.232 
0.023 



Table 6: Effective masses of light hadrons at /?=5.85 and k=0.1585 on the lattice 16 3 x 32(MILC) 
and 8 3 x 32 (ZSU, this work). 



of the single state, one must "smear" the operator out, e.g. 

°h(x) = J2 <?(£)/(£ - y)q(y), 



(11) 



where f(x) is some smooth function. Here we choose 

f(x) = iVexp(-|x| 2 /r 2 



o;> 



(12) 



with N a normalization factor. The size of the smeared operator should generally be comparable 
to the size of the hadron created. There is no automatic procedure for tuning the smearing 
parameter r^. One simply has to experiment with a couple of choices. We plot respectively in 
Figs. |SJ El and |H1 , the effective mass of 7T, p, proton and A particles, as a function of time t 
at j3 = 5.85, and k = 0.1585 on the 8 3 x 32 lattice. As one sees, the plateau from which one 
can estimate the effective mass, is very narrow for point source, due to the reason mentioned 
above. When the smearing source is used, the width of the plateau changes with the smearing 
parameter ro. We tried many values of ro and found that when tq > 16, the effective mass is 
almost independent on ro where we observe the widest plateau. These figures imply that the 
smearing technique plays more important role for heavier hadrons to suppress the contamination 
of the excited states; Furthermore, one has to do careful study using the smearing technique, 
before doing simulation on a larger lattice. 

We show the effective masses am n of the light hadrons in Tab. |f)]with smearing parameter 
ro = 18. The best fits to a range of points begin at t m i n =8 to t max =16. The masses are good 
agreement with the MILC previous results on the 16 3 x 32 lattice 21 . This means that finite size 
effects are small at this /3 and k. 

In Figs. l9*land lTUl we compare our results (with ro = 22). for ir mass squared, p mass, proton 
mass and A mass as a function of 1/k for = 5.7 with GF11 [22] on the same 8 3 x 32 lattice. 
The GF11 collaboration has 2439 configurations. Most results are consistent. 

To determine the relation between the lattice spacing a and coupling (3, one has to input the 
experimental value of a hadron mass (see Ref. J2H] for details). 



Group 
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Configs 


Lattice 


Source(s)^Sink 


Fit Range 


X7<fo/ 


Mass 


MILC 


0.1450 


23 


20 3 x 48 


ai(P) -> ai(P) 
1"+ -» 1"+ 
Q 4 - 1-+ 


6-11 
4-10 
3-7 


1.7/4 
3.5/5 
0.7/3 


1.312(8) 
1.88(8) 
1.65(5) 


ZSU 


0.1450 


120 


8 a x 32 


ai(P) - oi(P) 

1-+ -» 1-+ 
Q 4 -> 1-+ 


6-11 
4-10 
3-7 


1.5/4 
4.2/6 
1.2/3 


1.318(6) 
1.87(10) 
1.65(2) 



Table 7: Effective masses for the ordinary P-wave a\ meson and the exotic 1 
between MILC and ZSU. 



meson for (3 = 5.85 



5.3 cii(P) and 1 + hybrid meson masses 

At = 5.85 on the 8 3 x 32 lattice, 120 stored pure gauge configurations (see Tab. parameters) 
were re-used to study a\{P) and 1 h hybrid meson masses. BiCGstab algorithm was employed 
to compute the quark propagators with Wilson fermions and the residue is of O(10~ 7 ). Then we 
computed the correlation function using the sources and sinks in Tab. 0J from which the effective 
mass is extracted. Our results at k = 0.1450 and r$ = 16 are listed in Tab. [7[ and compared with 
the MILC dataf" 



6 Conclusions 

We have demonstrated that a parallel cluster of PC type computers is an economical way to build 
a powerful computing resource for academic purposes. On an MPI QCD benchmark simulation 
it compares favorably with other MPI platforms. 

We also present results for the light hadrons and 1 *~ hybrid meson from lattice QCD. Such 
large scale simulations had usually required super computing resources, but now they were all 
done on our PC cluster. A more careful and systematic study of the smearing method is made. 
Our results for f3 = 5.7 and 5.85 are consistent well with the data obtained on supercomputers by 
other groups on the same or larger lattices. This implies that finite size effects are small at these 
j3 values. 

To compare the lattice results with experiment, one needs to do simulations at larger (3 and 
carefully study the lattice spacing errors. According to the literature, there are strong finite size 
effects for the Wilson action at f3 > 6.0 and very larger lattice volume is required. In this aspect, it 
is more efficient to use the improved action and some progress has been reported in Refs. |251 I26j . 

In conclusion, we are confident that ZSU's Pentium cluster can provide a very flexible and 
extremely economical computing solution, which fits the demands and budget of a developing 
lattice field theory group. We are going to use the machine to produce more useful results of 
non-perturbative physics. 

This work was in part based on the MILC collaboration's public lattice gauge theory code. (See 
reference |19j.) We are grateful to C. DeTar, S. Gottlieb and D. Toussiant for helpful discussions. 
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Figure 1: Schematic diagram of a parallel cluster. 
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Figure 2: Performance (in Mega-Flops) per CPU versus the number of CPUs in the MILC QCD 
code benchmark. 
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Figure 3: Communications bandwidth vs. message size. 
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Figure 4: Green function of it at j3 = 5.85 and k = 0.1585 on the 8 3 x 32 lattice. The error bars 
represent statistical errors determined by the Jackknife method. 
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Figure 5: n effective mass fits to the correlation function at (5 = 5.85 and k = 0.1585 and on the 
8 3 x 32 lattice. Data for the point source, smearing source for r$ = 1, and tq = 18 are labeled by 
circles and squares respectively (from top to bottom). 




Figure 6: The same as Fig. but for the p particle. 
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Figure 7: The same as Fig. El but for the proton particle. 
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Figure 8: The same as Fig. [SJ but for the A particle. 
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Figure 9: Pion mass squared as a function of 1/k for (3=5.7. ZSU's and GFll's results are labeled 
by circles and squares. 
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Figure 10: Effective mass of p (GF11: triangle left, ZSU: diamond), proton (GF11: triangle 
down, ZSU: circle), A (GF11: triangle up, ZSU: square) as a function of 1/k for (3=5.7. The 
points at the smallest value of 1/k is the ZSU result extrapolated to the chiral limit. 
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